37 research outputs found
Human Associations Help to Detect Conventionalized Multiword Expressions
In this paper we show that if we want to obtain human evidence about
conventionalization of some phrases, we should ask native speakers about
associations they have to a given phrase and its component words. We have shown
that if component words of a phrase have each other as frequent associations,
then this phrase can be considered as conventionalized. Another type of
conventionalized phrases can be revealed using two factors: low entropy of
phrase associations and low intersection of component word and phrase
associations. The association experiments were performed for the Russian
language
Cross-domain opinion word extraction model
In this paper we consider a new approach for domain-specific opinion word extraction in Russian. We propose a set of statistical features and algorithm combination that can discriminate opinion words in a particular domain. The extraction model is trained in a movie domain and then applied to four other domains. We evaluate the quality of obtained sentiment lexicons intrinsically. Finally, our method is adapted to a movie domain in English and demonstrates comparable results
Attention-Based Neural Networks for Sentiment Attitude Extraction using Distant Supervision
In the sentiment attitude extraction task, the aim is to identify
> -- sentiment relations between entities mentioned in text. In
this paper, we provide a study on attention-based context encoders in the
sentiment attitude extraction task. For this task, we adapt attentive context
encoders of two types: (1) feature-based; (2) self-based. In our study, we
utilize the corpus of Russian analytical texts RuSentRel and automatically
constructed news collection RuAttitudes for enriching the training set. We
consider the problem of attitude extraction as two-class (positive, negative)
and three-class (positive, negative, neutral) classification tasks for whole
documents. Our experiments with the RuSentRel corpus show that the three-class
classification models, which employ the RuAttitudes corpus for training, result
in 10% increase and extra 3% by F1, when model architectures include the
attention mechanism. We also provide the analysis of attention weight
distributions in dependence on the term type.Comment: 10 pages, 9 figures. The preprint of an article published in the
proceedings of the 10th International Conference on Web Intelligence, Mining
and Semantics (WIMS 2020). The final authenticated publication is available
online at https://doi.org/10.1145/3405962.3405985. arXiv admin note:
substantial text overlap with arXiv:2006.1160
An Approach to New Ontologies Development: Main Ideas and Simulation Results
In the paper we consider the technology of new domain's ontologies development. We discuss
main principles of ontology development, automatic methods of terms extraction from the domain texts and
types of ontology relations
RuSentNE-2023: Evaluating Entity-Oriented Sentiment Analysis on Russian News Texts
The paper describes the RuSentNE-2023 evaluation devoted to targeted
sentiment analysis in Russian news texts. The task is to predict sentiment
towards a named entity in a single sentence. The dataset for RuSentNE-2023
evaluation is based on the Russian news corpus RuSentNE having rich
sentiment-related annotation. The corpus is annotated with named entities and
sentiments towards these entities, along with related effects and emotional
states. The evaluation was organized using the CodaLab competition framework.
The main evaluation measure was macro-averaged measure of positive and negative
classes. The best results achieved were of 66% Macro F-measure
(Positive+Negative classes). We also tested ChatGPT on the test set from our
evaluation and found that the zero-shot answers provided by ChatGPT reached 60%
of the F-measure, which corresponds to 4th place in the evaluation. ChatGPT
also provided detailed explanations of its conclusion. This can be considered
as quite high for zero-shot application.Comment: 12 pages, 5 tables, 3 figure
TatWordNet: A Linguistic Linked Open Data-Integrated WordNet Resource for Tatar
We present the first release of TatWordNet (http://wordnet.tatar), a wordnet resource for Tatar. TatWordNet has been constructed by the combination of the expand and the merge approaches. The synsets of TatWordNet have been compiled by: (i) the automatic conversion of concepts of TatThes, a socio-political Tatar; (ii) semi-automatic translation of synsets of RuWordNet, a wordnet resource for Russian with the followed manual verification and correction; (iii) manual translation of base RuWordNet synsets; (iv) and manual translation of the all hypernyms of the previously translated RuWordNet synsets. The currents version of TatWordNet contains 18,583 synsets, 36,540 lexical entries and 49,525 senses. The resource has been published to the Linguistic Linked Open Data cloud and interlinked with the Global WordNet Grid
RUSSE'2018 : a shared task on word sense induction for the Russian language
The paper describes the results of the first shared task on word sense induction (WSI) for the Russian language. While similar shared tasks were conducted in the past for some Romance and Germanic languages, we explore the performance of sense induction and disambiguation methods for a Slavic language that shares many features with other Slavic languages, such as rich morphology and free word order. The participants were asked to group contexts with a given word in accordance with its senses that were not provided beforehand. For instance, given a word “bank” and a set of contexts with this word, e.g. “bank is a financial institution that accepts deposits” and “river bank is a slope beside a body of water”, a participant was asked to cluster such contexts in the unknown in advance number of clusters corresponding to, in this case, the “company” and the “area” senses of the word “bank”. For the purpose of this evaluation campaign, we developed three new evaluation datasets based on sense inventories that have different sense granularity. The contexts in these datasets were sampled from texts of Wikipedia, the academic corpus of Russian, and an explanatory dictionary of Russian. Overall 18 teams participated in the competition submitting 383 models. Multiple teams managed to substantially outperform competitive state-of-the-art baselines from the previous years based on sense embeddings